This notebook will teach you to use PyTorch low-level core. You can install it here. For high-level interface see the next notebook.
PyTorch feels differently than tensorflow/theano on almost every level. TensorFlow makes your code live in two "worlds" simultaneously: symbolic graphs and actual tensors. First you declare a symbolic "recipe" of how to get from inputs to outputs, then feed it with actual minibatches of data. In PyTorch, there's only one world: all tensors have a numeric value.
You compute outputs on the fly without pre-declaring anything. The code looks exactly as in pure numpy with one exception: PyTorch computes gradients for you. And can run stuff on GPU. And has a number of pre-implemented building blocks for your neural nets. And a few more things.
And now we finally shut up and let PyTorch do the talking.
In [ ]:
import sys
if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):
!wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/week04_%5Brecap%5D_deep_learning/notmnist.py
!touch .setup_complete
In [ ]:
import numpy as np
import torch
print(torch.__version__)
In [ ]:
# numpy world
x = np.arange(16).reshape(4, 4)
print("X :\n%s\n" % x)
print("X.shape : %s\n" % (x.shape,))
print("add 5 :\n%s\n" % (x + 5))
print("X*X^T :\n%s\n" % np.dot(x, x.T))
print("mean over cols :\n%s\n" % (x.mean(axis=-1)))
print("cumsum of cols :\n%s\n" % (np.cumsum(x, axis=0)))
In [ ]:
# PyTorch world
x = np.arange(16).reshape(4, 4)
x = torch.tensor(x, dtype=torch.float32) # or torch.arange(0,16).view(4,4)
print("X :\n%s" % x)
print("X.shape : %s\n" % (x.shape,))
print("add 5 :\n%s" % (x + 5))
print("X*X^T :\n%s" % torch.matmul(x, x.transpose(1, 0))) # short: x.mm(x.t())
print("mean over cols :\n%s" % torch.mean(x, dim=-1))
print("cumsum of cols :\n%s" % torch.cumsum(x, dim=0))
As you can notice, PyTorch allows you to hack stuff much the same way you did with numpy. No graph declaration, no placeholders, no sessions. This means that you can see the numeric value of any tensor at any moment of time. Debugging such code can be done with by printing tensors or using any debug tool you want (e.g. gdb).
You could also notice the a few new method names and a different API. So no, there's no compatibility with numpy yet and yes, you'll have to memorize all the names again. Get excited!
For example,
x.reshape([1,2,8]) -> x.view(1,2,8)
x.sum(axis=-1) -> x.sum(dim=-1)
x.astype('int64') -> x.type(torch.LongTensor)
To help you acclimatize, there's a table covering most new things. There's also a neat documentation page.
Finally, if you're stuck with a technical problem, we recommend searching PyTorch forumns. Or just googling, which usually works just as efficiently.
If you feel like you almost give up, remember two things: GPU an free gradients. Besides you can always jump back to numpy with x.numpy()
inspired by this post
There are some simple mathematical functions with cool plots. For one, consider this:
$$ x(t) = t - 1.5 * cos( 15 t) $$$$ y(t) = t - 1.5 * sin( 16 t) $$
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
t = torch.linspace(-10, 10, steps=10000)
# compute x(t) and y(t) as defined above
x = <YOUR CODE>
y = <YOUR CODE>
plt.plot(x.numpy(), y.numpy())
if you're done early, try adjusting the formula and seing how it affects the function
Any self-respecting DL framework must do your backprop for you. Torch handles this with the autograd
module.
The general pipeline looks like this:
requires_grad
:torch.zeros(5, requires_grad=True)
loss = arbitrary_function(a)
loss.backward()
a.grads
Here's an example: let's fit a linear regression on Boston house prices
In [ ]:
from sklearn.datasets import load_boston
boston = load_boston()
plt.scatter(boston.data[:, -1], boston.target)
Out[ ]:
In [ ]:
from torch.autograd import Variable
w = torch.zeros(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
x = torch.tensor(boston.data[:, -1] / 10, dtype=torch.float32)
y = torch.tensor(boston.target, dtype=torch.float32)
In [ ]:
y_pred = w * x + b
loss = torch.mean((y_pred - y)**2)
# propagete gradients
loss.backward()
The gradients are now stored in .grad
of those variables that require them.
In [ ]:
print("dL/dw = \n", w.grad)
print("dL/db = \n", b.grad)
If you compute gradient from multiple losses, the gradients will add up at variables, therefore it's useful to zero the gradients between iteratons.
In [ ]:
from IPython.display import clear_output
for i in range(100):
y_pred = w * x + b
loss = torch.mean((y_pred - y)**2)
loss.backward()
w.data -= 0.05 * w.grad.data
b.data -= 0.05 * b.grad.data
# zero gradients
w.grad.data.zero_()
b.grad.data.zero_()
# the rest of code is just bells and whistles
if (i+1) % 5 == 0:
clear_output(True)
plt.scatter(x.data.numpy(), y.data.numpy())
plt.scatter(x.data.numpy(), y_pred.data.numpy(),
color='orange', linewidth=5)
plt.show()
print("loss = ", loss.data.numpy())
if loss.data.numpy() < 0.5:
print("Done!")
break
Bonus quest: try implementing and writing some nonlinear regression. You can try quadratic features or some trigonometry, or a simple neural network. The only difference is that now you have more variables and a more complicated y_pred
.
So far we've been dealing with low-level torch API. While it's absolutely vital for any custom losses or layers, building large neura nets in it is a bit clumsy.
Luckily, there's also a high-level torch interface with a pre-defined layers, activations and training algorithms.
We'll cover them as we go through a simple image recognition problem: classifying letters into "A" vs "B".
In [ ]:
from notmnist import load_notmnist
X_train, y_train, X_test, y_test = load_notmnist(letters='AB')
X_train, X_test = X_train.reshape([-1, 784]), X_test.reshape([-1, 784])
print("Train size = %i, test_size = %i" % (len(X_train), len(X_test)))
In [ ]:
for i in [0, 1]:
plt.subplot(1, 2, i + 1)
plt.imshow(X_train[i].reshape([28, 28]))
plt.title(str(y_train[i]))
Let's start with layers. The main abstraction here is torch.nn.Module
In [ ]:
from torch import nn
import torch.nn.functional as F
print(nn.Module.__doc__)
There's a vast library of popular layers and architectures already built for ya'.
This is a binary classification problem, so we'll train a Logistic Regression with sigmoid. $$P(y_i | X_i) = \sigma(W \cdot X_i + b) ={ 1 \over {1+e^{- [W \cdot X_i + b]}} }$$
In [ ]:
# create a network that stacks layers on top of each other
model = nn.Sequential()
# add first "dense" layer with 784 input units and 1 output unit.
model.add_module('l1', nn.Linear(784, 1))
# add softmax activation for probabilities. Normalize over axis 1
# note: layer names must be unique
model.add_module('l2', nn.Sigmoid())
In [ ]:
print("Weight shapes:", [w.shape for w in model.parameters()])
In [ ]:
# create dummy data with 3 samples and 784 features
x = torch.tensor(X_train[:3], dtype=torch.float32)
y = torch.tensor(y_train[:3], dtype=torch.float32)
# compute outputs given inputs, both are variables
y_predicted = model(x)[:, 0]
y_predicted # display what we've got
Out[ ]:
Let's now define a loss function for our model.
The natural choice is to use binary crossentropy (aka logloss, negative llh): $$ L = {1 \over N} \underset{X_i,y_i} \sum - [ y_i \cdot log P(y_i=1 | X_i) + (1-y_i) \cdot log (1-P(y_i=1 | X_i)) ]$$
In [ ]:
crossentropy = <YOUR CODE>
loss = <YOUR CODE>
assert tuple(crossentropy.size()) == (
3,), "Crossentropy must be a vector with element per sample"
assert tuple(loss.size()) == tuple(
), "Loss must be scalar. Did you forget the mean/sum?"
assert loss.data.numpy() > 0, "Crossentropy must non-negative, zero only for perfect prediction"
assert loss.data.numpy() <= np.log(
3), "Loss is too large even for untrained model. Please double-check it."
Note: you can also find many such functions in torch.nn.functional
, just type F.<tab>
.
Torch optimizers
When we trained Linear Regression above, we had to manually .zero_() gradients on both our variables. Imagine that code for a 50-layer network.
Again, to keep it from getting dirty, there's torch.optim
module with pre-implemented algorithms:
In [ ]:
opt = torch.optim.RMSprop(model.parameters(), lr=0.01)
# here's how it's used:
loss.backward() # add new gradients
opt.step() # change weights
opt.zero_grad() # clear gradients
In [ ]:
# dispose of old variables to avoid bugs later
del x, y, y_predicted, loss, y_pred
In [ ]:
# create network again just in case
model = nn.Sequential()
model.add_module('first', nn.Linear(784, 1))
model.add_module('second', nn.Sigmoid())
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
In [ ]:
history = []
for i in range(100):
# sample 256 random images
ix = np.random.randint(0, len(X_train), 256)
x_batch = torch.tensor(X_train[ix], dtype=torch.float32)
y_batch = torch.tensor(y_train[ix], dtype=torch.float32)
# predict probabilities
y_predicted = <YOUR CODE>
assert y_predicted.dim(
) == 1, "did you forget to select first column with [:, 0]"
# compute loss, just like before
loss = <YOUR CODE>
# compute gradients
<YOUR CODE>
# Adam step
<YOUR CODE>
# clear gradients
<YOUR CODE>
history.append(loss.data.numpy())
if i % 10 == 0:
print("step #%i | mean loss = %.3f" % (i, np.mean(history[-10:])))
Debugging tips:
In [ ]:
# use your model to predict classes (0 or 1) for all test samples
predicted_y_test = <YOUR CODE>
assert isinstance(predicted_y_test, np.ndarray), "please return np array, not %s" % type(
predicted_y_test)
assert predicted_y_test.shape == y_test.shape, "please predict one class for each test sample"
assert np.in1d(predicted_y_test, y_test).all(), "please predict class indexes"
accuracy = np.mean(predicted_y_test == y_test)
print("Test accuracy: %.5f" % accuracy)
assert accuracy > 0.95, "try training longer"
When dealing with more complex stuff like neural network, it's best if you use tensors the way samurai uses his sword.
1.1 the cannabola disclaimer
Let's write another function, this time in polar coordinates: $$\rho(\theta) = (1 + 0.9 \cdot cos (8 \cdot \theta) ) \cdot (1 + 0.1 \cdot cos(24 \cdot \theta)) \cdot (0.9 + 0.05 \cdot cos(200 \cdot \theta)) \cdot (1 + sin(\theta))$$
Then convert it into cartesian coordinates (howto) and plot the results.
Use torch tensors only: no lists, loops, numpy arrays, etc.
In [ ]:
theta = torch.linspace(- np.pi, np.pi, steps=1000)
# compute rho(theta) as per formula above
rho = <YOUR CODE>
# Now convert polar (rho, theta) pairs into cartesian (x,y) to plot them.
x = <YOUR CODE>
y = <YOUR CODE>
plt.figure(figsize=[6, 6])
plt.fill(x.numpy(), y.numpy(), color='green')
plt.grid()
Now it's time for you to make something more challenging. We'll implement Conway's Game of Life in pure PyTorch.
While this is still a toy task, implementing game of life this way has one cool benefit: you'll be able to run it on GPU! Indeed, what could be a better use of your gpu than simulating game of life on 1M/1M grids?
For this task, you are given a reference numpy implementation that you must convert to PyTorch. [numpy code inspired by: https://github.com/rougier/numpy-100]
Note: You can find convolution in torch.nn.functional.conv2d(Z,filters)
. Note that it has a different input format.
Note 2: From the mathematical standpoint, PyTorch convolution is actually cross-correlation. Those two are very similar operations. More info: video tutorial, scipy functions review, stack overflow source.
In [ ]:
from scipy.signal import correlate2d
def np_update(Z):
# Count neighbours with convolution
filters = np.array([[1, 1, 1],
[1, 0, 1],
[1, 1, 1]])
N = correlate2d(Z, filters, mode='same')
# Apply rules
birth = (N == 3) & (Z == 0)
survive = ((N == 2) | (N == 3)) & (Z == 1)
Z[:] = birth | survive
return Z
In [ ]:
def torch_update(Z):
"""
Implement an update function that does to Z exactly the same as np_update.
:param Z: torch.FloatTensor of shape [height,width] containing 0s(dead) an 1s(alive)
:returns: torch.FloatTensor Z after updates.
You can opt to create new tensor or change Z inplace.
"""
# <Your code here!>
return Z
In [ ]:
# initial frame
Z_numpy = np.random.choice([0, 1], p=(0.5, 0.5), size=(100, 100))
Z = torch.from_numpy(Z_numpy).type(torch.FloatTensor)
# your debug polygon :)
Z_new = torch_update(Z.clone())
# tests
Z_reference = np_update(Z_numpy.copy())
assert np.all(Z_new.numpy(
) == Z_reference), "your PyTorch implementation doesn't match np_update. Look into Z and np_update(ZZ) to investigate."
print("Well done!")
In [ ]:
%matplotlib notebook
plt.ion()
# initialize game field
Z = np.random.choice([0, 1], size=(100, 100))
Z = torch.from_numpy(Z).type(torch.FloatTensor)
fig = plt.figure()
ax = fig.add_subplot(111)
fig.show()
for _ in range(100):
# update
Z = torch_update(Z)
# re-draw image
ax.clear()
ax.imshow(Z.numpy(), cmap='gray')
fig.canvas.draw()
In [ ]:
# Some fun setups for your amusement
# parallel stripes
Z = np.arange(100) % 2 + np.zeros([100, 100])
# with a small imperfection
Z[48:52, 50] = 1
Z = torch.from_numpy(Z).type(torch.FloatTensor)
fig = plt.figure()
ax = fig.add_subplot(111)
fig.show()
for _ in range(100):
Z = torch_update(Z)
ax.clear()
ax.imshow(Z.numpy(), cmap='gray')
fig.canvas.draw()
More fun with Game of Life: video
This time you will solve the same digit recognition problem, but at a greater scale
We want you to build a network that reaches at least 80% accuracy and has at least 2 linear layers in it. Naturally, it should be nonlinear to beat logistic regression. You can implement it with either
With 10 classes you will need to use Softmax at the top instead of sigmoid and train for categorical crossentropy (see here). Write your own loss or use torch.nn.functional.nll_loss
. Just make sure you understand what it accepts as an input.
Note that you are not required to build 152-layer monsters here. A 2-layer (one hidden, one output) neural network should already give you an edge over logistic regression.
[bonus kudos] If you've already beaten logistic regression with a two-layer net, but enthusiasm still ain't gone, you can try improving the test accuracy even further! It should be possible to reach 90% without convnets.
SPOILERS!
At the end of the notebook you will find a few tips and frequent errors.
If you feel confident enogh, just start coding right away and get there if once you need to untangle yourself.
In [ ]:
from notmnist import load_notmnist
X_train, y_train, X_test, y_test = load_notmnist(letters='ABCDEFGHIJ')
X_train, X_test = X_train.reshape([-1, 784]), X_test.reshape([-1, 784])
In [ ]:
%matplotlib inline
plt.figure(figsize=[12, 4])
for i in range(20):
plt.subplot(2, 10, i+1)
plt.imshow(X_train[i].reshape([28, 28]))
plt.title(str(y_train[i]))
In [ ]:
<YOUR CODE: a whole lot of it>
In [ ]:
Recommended pipeline
Add a hidden layer. Now your logistic regression uses hidden neurons instead of inputs.
Now's the time to try improving the network. Consider layers (size, neuron count), nonlinearities, optimization methods, initialization - whatever you want, but please avoid convolutions for now.
If anything seems wrong, try going through one step of training and printing everything you compute.